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~ Abstract 

' To assess how future progress in gravitational microlensing computation at high optical depth will rely on both hardware and 

^ software solutions, we compare a direct inverse ray-shooting code implemented on a graphics processing unit (GPU) with both a 

k>^ widely-used hierarchical tree code on a single-core CPU, and a recent implementation of a parallel tree code suitable for a CPU- 

^5 based cluster supercomputer. We examine the accuracy of the tree codes through comparison with a direct code over a much wider 

^j range of parameter space than has been feasible before. We demonstrate that all three codes present comparable accuracy, and 

-y-. choice of approach depends on considerations relating to the scale and nature of the microlensing problem under investigation. 

pg On current hardware, there is little difference in the processing speed of the single-core CPU tree code and the GPU direct code, 

however the recent plateau in single-core CPU speeds means the existing tree code is no longer able to take advantage of Moore's 

I l' law-like increases in processing speed. Instead, we anticipate a rapid increase in GPU capabilities in the next few years, which 

^j is advantageous to the direct code. We suggest that progress in other areas of astrophysical computation may benefit from a 

*~~J transition to GPUs through the use of "brute force" algorithms, rather than attempting to port the current best solution directly to a 

,-C GPU language - for certain classes of problems, the simple implementation on GPUs may already be no worse than an optimised 



CU 



single-core CPU version. 
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1. Introduction 

Gravitational microlensing is the deflection of light by mat- 
ter in the regime where multiple-imaging and high magni- 
fication events occur, but the angular separation of images 
is too small to resolve. The observable effect is a change 
in brightness of a source over a period of time, with the 
details of the light curve depending on the complexity of 
the lens distribution and the surface brightness profile of 
the source. Gravitational microlensing has been used to 
study high magnification events on local scales, viz. the 
searches for dark matter compact objects in the Galactic 
bulge and halo (see for example references in and citations to 
Alcocket al.l fl 993: AubourgetalJll993l llJdalski et~ai][l993: 



Calchi Novati et al.1 12002; Wvrzvkowski et al. 2009) a nd ex 



trasolar planets ( e.g. iGould & Loeblll992l; iBond et al.1 12004 ; 



<(e 
10) 



Sum i et al.l 120101) , and quasar microlens i ng on cosmologi- 



cal scales (e.g. Vanderrie st et al.l 1989t llrwin et al.l Il989 ; 



KochaneklEooi iBate et alJJ2008t bai et al.lboiol). Fo r recent 



reviews of micr o lensin g on al l scales, see Wambsganss (2006), 
Kochaneketail d2007l) . iGouldl d2009l) . iMaol d2008l) and lGaudil 
d2010h . 



Microlensing on local scales occurs in the low optical depth 
regime, where the source is affected by at most a few mi- 
crolenses along the line of sight. Large numbers of stars must 
therefore be monitored in order to observe a single microlens- 
ing event. Conversely, the microlensing optical depth at the 
location of lensed quasar images is likely to be ~ 1 . This mi- 
crolensing signal is therefore more complex, as it is the result of 
a large num ber of microlenses a cting on the source essentially 
all the time d Wambsganssi 120061) . In the remainder of this work, 
we focus on high optical depth microlensing. 

A key element in many microlensing calculations is the de- 
termination of the magnification map; a typical example is 
shown in Figure Q] This map provides an estimate of the point- 
source brightening over a finite region of a two-dimensional 
source plane. The magnification of an extended source is found 
by convolving a source profile (intensity and geometry) with the 
magnification map. This approach has been used to probe the 
structure of quasar emission regions, including accretion discs 
(e.g. lKochanekll2004l:lEig"enbrod et al] 2008: Fl ovd et al I J2009) 
and broad emissio n line regions (e.g. iLewis & Ib ata 2004; 
Abajas et alJl2007t) . 
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The limitation in many microlensing calculations is the com- 
putational overhead in producing magnification maps with suf- 
ficient resolution and high (statistical) accuracy. Fitting ob- 
served light-curves obtained over long periods of time requires 
large maps with high pixel resolution to resolve small sources 
(such as quasar X-ray emission regions). Similarly, fitting 
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Figure 1 : A sample microlensing magnification map for a mi- 
crolensing model with the following parameters: total conver- 
gence o" = 0.4, shear y - 0.0, N v \ x = 1024 2 pixels in the source 
plane and N rays = 2000 light rays per source pixel (on average). 
These parameters correspond to N* - 558 lenses. The source 
plane has a side length of 20 Einstein radii. The greyscale cov- 
ers the range from low magnification (black) to high magnifica- 
tion (white). 



multi-wavelength observations requires both large maps and 
high pixel resolution in order to simultaneously model emission 
from small optical emission regions and larger infrared emis- 
sion regions. The situation is even worse when attempts are 
made to simultaneously model microlensing of accretion discs 
and the much larger broad emission line regions, a difference 
in scale of 2-3 orders of magnitude. Finally, many maps must 
be generated in order to ensure statistical independence of sim- 
ulated measurements. It is only reasonable to sample any one 
magnification map as many times as you can place a source on 
it without overlapping. 

In this paper we compare three techniques for computing 
magnification maps: a hierarchical tree code for single-core 
CPUs; a parallel, large data tree code for use on a CPU-based 
cluster supercomputer; and a direct, inverse ray-shooting code 
for use on graphics processing units (GPUs). We consider the 
accuracy of the codes, to determine whether the approximations 
inherent in the tree code are valid over a wider range of param- 
eter space than has been feasible to test before, and compare 
run-times between the single-core tree code and the GPU code 
over a range of astrophy sic ally-motivated parameters. While 
the "brute force" approach to ray shooting has not been con- 
sidered a feasible option on CPU architectures, we show that 
the GPU implementation is highly competitive, and is indeed 
faster than the tree code in many circumstances. The expected 
increase in GPU speeds in the years ahead bodes well for ad- 
vances in gravitational microlensing simulations. Our work 
hints at the potential benefit to other fields of astrophysical com- 



putation from the implementation of simple, "brute force" al- 
gorithms as a means of accelerating the adoption of GPUs in 
astronomy for time-consuming computations. 

The remainder of this paper is set out as follows. In Section 
[2] we describe the process of generating magnification maps, 
and summarise key features of three gravitational microlensing 
codes which can be used for this task. In Section [3] we present 
a series of code comparisons (accuracy and speed). In Section 
|4] we discuss the relative merits of hierarchical and direct ray- 
shooting approaches, and suggest the classes of problems to 
which each is best-suited. We present our conclusions in Sec- 
tion 

2. Generating magnification maps 

In general, magnification maps are calculated using the (two- 
dimensional) gravitational lens equation 



qt(x), 



(1) 



which relates the locations of the source, y, and an image, x, via 
the deflection angle, a(x). The magnification is: 



H - 1/detA 



(2) 



where A = dy/dx is the Jacobian matrix of equation {JJ. 

Equation ([TJ gives a one-to-one relationship between an 
image location and the corresponding source location (x — > 
y), but a one-to-many relationship for the alternative case of 
mapping a source location to its multiple images (y — > x). 
While analytic sol utions exist for a s mall number of simple 
lens models (e.g. ISchneider et al.lll992l) . in general a numer- 
ical approach is used to calculate /i over a finite region of 
the source plane. Thi s can be achieve d most easily using 



inverse ray-shootin g (Kavs er et al.l Il986t ISchneider & Weiss 
1986| ISchneider & Weissl Il987l) . Here, light rays are prop- 



agated backwards from the observer, through the lens plane, 
where they are deflected using equation ([TJ, and then mapped 
to a grid of pixels in the source plane. This approach avoids 
the need to determine the location of every image. By counting 
the number of light rays reaching each source plane pixel, N/j, 
compared to the average per pixel if there was no lensing, N rays , 
an estimate of the per-pixel magnification, yu (i , is obtained: 



*' ij i *rays- 



(3) 



For cosmological microlensing problems, the appropriate 
model for the deflection angle describes N t compact lens 
masses in a smooth background mass distribution, acted on by 
an external (macro-mo del) shear y. This mode l, an extension 
of the one proposed by Chang & Refsdall (I1979I) . is suitable for 
considering the impact of lensing due to individual stars where 
the external shear is due to the large-scale galactic mass distri- 
bution. The lens equation in this case is: 



y = ( V i! r ) x -^ x -|> 



(x - x,) 



(4) 



where the convergence cr — cr c + cr* combines contributions 
from smooth matter cr c , and compact objects cr*. This is the 
model for the deflection of light by a screen of mass that we 
will use throughout this paper. 

The relationship between cr, and the number of lenses N* is: 

cr,A 



N,. 



n{M) 



(5) 



If the deflecting mass distribution were smoothed out, a circle in 
the image plane would map onto an ellipse with major-to-minor 
axis ratio |1 — cr + y|/|l - cr — y\ in the lens plane. The area of 
this 'minimum' ellipse is set by the size of the image plane. In 
practice, the graininess of the matter distribution can cause rays 
to be scattered outside the receiving area. Similarly, rays from 
a long way outside the shooting region can be scattered into 
the receiving square in the source plane. For that reason, the 
lenses in the codes considered here are distributed in a circle of 
area A with a diameter larger than the long axis of the minimum 
ellipse. 

The number of calculations, <£, required to obtain a mangifi- 
cation map with A^ p ; x pixels scales as: 



O 



Nflop x N pix x N t x N mys , 



(6) 



where A^flop is the number of floating point operations for each 
deflection calculation, and N rdys is chosen to achieve a desired 
level of accuracy. It was quickly realised that the direct ray- 
shooting approach was not practical on single CPUs: a sce- 



nario, (D 



fiducial 



3 x 10 lb , with JVpi x = 2500 z pixels, N T 



■ays 



500, 
N, — 10 6 lenses^ and JVflqp = 1 would take months to years 
to calculate (iWambsganssl 1 19991) on hardware that was avail- 
able at the time. 

Wambsganss (1990; 1999) introduced a m ethod based on a 
hierarchical tree code ([Barnes & Hun, Il986l) . Here, the con- 
tribution of each lens depends on its distance from the light 
ray - lenses that are further away are grouped together into 
pseudo-lenses with a higher mass. This approach reduces 
N t , thus increasing the speed of the overall calculation, but 
at the cost of a slight error in the magnification map due 
to the approximation. This approach has been used to con- 
strain both the s ize and temperature p rofile of quasar accre- 
tion discs (e.g. [W ambsganss, Paczvnski, & Schneider 1990; 
Rauch & Blandfordl ll99u""lMorgan et al. 1 h " n ' i 



2007; Keeton et al. 2006; Anguita et al 



2006; Poolevetal. 
2008; Eigenbrod et al. 



20081 iBate et al.ll2008l:lFlovd et al.ll2009 ), dark matter fraction s 



in the lensing galaxies (e.g. IPoolev et al .1120091: iDai et al.l2010h . 
and possibly eve n the orientation of the quasar with respect to 
the line of sight dPoindexter & KochanekluOlOl) . 

Despite the much-improved speed compared to direct ray- 
shooting, the hierarchical method is complex to implement, and 
requires a certain amount of expert knowledge. These facts 
make slight modifications or additions to the code quite chal- 
lenging. Parameters such as the required accuracy (which de- 
termines whether to use a particular lens cell, or its four sub- 
cells, based on the distance between the ray and the cell's cen- 
tre of mass) must be determined experimentally by repeatedly 



running the code until acceptable results are obtained. Ad- 
ditionally, the generation of the tree data structure imposes a 
memory overhead which exceeds typical single CPU memory 
sizes for large N*. A parallel hierarchical tree code, suitable for 
running billion-lens calc ulations, has now been developed by 
Garsden & Le wis (2010). This approach overcomes the single- 
CPU memory limitation by distributing the computation over a 
number of nodes in a CPU-based cluster supercomputer. 

Alternatives to the hierarchical method include the use of 



polygonal cells (iMediavilla et all 2006), optimised to map ar- 
eas of the image plane to the source plane thus reducing N rdys 
required for a given accuracy, and the con touring method for 
point sources OLewis et al.lll9 93: Witt 1993), which maps an in- 
finite line plus N t closed loops from the image to source plane. 
The polygonal cell approach results in its own computational 
overhead in determining the correct polygonal lattice, and suf- 
fers from the need for a substantial number of polygons in the 
vicinity of caustics, which results in lengthy computation times. 
The cont ouring method has limite d applicability to extended 



Wvithe & Webster! ( 1 19991) presented an extension of 



sources. 

the contouring method to cover extended sources, however the 

additional computations required slowed the code to the point 

where it could no longer compete with the simplicity of inverse 

ray-shooting. 

A recent alternative is the appearance of the GPU as 
a mathematical co-processor. Originally developed to 
enhance the rate of generating graphics for the com- 
puter game industry, GPUs are rev ol utionising sc i entific 
computing ( Fournier & Fusselll 1988 : iTomovetalJ 12003c 



I — II II II L 

Venkatasubramanian 2003; Owens et al. 2005). Their highly 

parallel processing architecture, accessed through flexibile pro- 
gramming libraries such as NVIDIA's CUDAlI or the cross- 
platform OpenClJj standard created by the Khronos Group, is 
enabling speed-ups of 0( 1 0) to 0( 1 00) times over a broad range 
of algorithms. Indeed, the notion of general purpose computing 
on graphics processing units (GPGPU) has motivated a reinves- 
tigation of time-consuming processing tasks. 

Early adoption of G PGPU in astronomy has included N- 
body simulations (e.g., Belleman et al. 2008), radio-telescope 
signal correlation (e .g.. iHarris et alj 120081) . the solution of 



rd, 
39) 



Kep ler's equations dFordi 120081) . radiation-transfer models 



( Jonsson & Primack, 2009) and adaptive mesh refinement sim- 
ulations (iSchive et all 12010). 

From a conceptual and implementational point of view, direct 
ray-shooting is a very simple algorithm with a high degree of 
parallelism: the deflection of each light ray is independent of 
every other light ray, and the deflection of a light ray due to 
a single lens is independent of all other lenses. The existence 
of such fine-grained parallelism means that ray-shooting can 
be considered an "embarassingly parallel" algorithm - ideal for 
G PGPU. 

Thompson et al.l (120 10» reported on a CUDA implementation 



of direct ray-shooting, and demonstrated that their code could 
calculate a magnification map for the Ofidudai scenario in ~ 7 



1 Relevant for the case cr ~ 1 



http : //www . nvidia . com/obj ect /cuda_home .html| 



http : //www . khronos . org/opencl 



hours using a four-GPU NVIDIA S1070 Tesla unit. Moreover, 
they were able to calculate a magnification map with one billion 
lenses (N p \ x = 512 2 and N lays - 42) in 24 hours, with a peak 
sustained processing rate of 1.28 Tflop/sec. While the CUDA 
implementation of direct ray-shooting gave an (9(100) process- 
ing speed-up compared to using the same method on a sin 



gle CP U, a critical comparison omitted from Thomps on et al 



(120 101) was between the tree code and the GPU code. 
In this paper, we compare: 

• The "industry standard" single-core tree code by Wambs- 
ganss (1990; 1999), which we refer to as CPU-T; 



The parallel, large-data tree code by iGarsden & Lewis 
(2010), which we refer to as CPU-P; and 



The C UDA direct ray-shooting code by iThompson et al 
d2010l) . which we refer to as GPU-D. 



We now describe salient points of these three approaches in 
more detail, including implementation issues that are relevant 
for the code comparison in Section [3] 

2.1. CPU-T 

The most computationally intensive part of the ray-shooting 
algorithm is the calculation of the deflection of individual rays 
by each lens. The hierarchical tree code attempts to min- 
imise the number of calculations necessary by grouping to- 
gether lenses that are sufficiently distant from a given light ray. 
A group of lenses is treated as a single pseudo-lens with a total 
mass equal to the sum of the individual lens masses, located at 
their centre of mass. 

The first step in the calculation is to divide up the lenses into 
a cell structure, starting with the root cell, which is the size of 
the lens plane. Each cell that contains more than one lens is 
subdivided into four cells with half the side length of the parent 
cell. This process is continued iteratively until only cells con- 
taining zero or one lens remain. This calculation need only be 
carried out once per magnification map, and typically takes only 
a small fraction of the code's total run-time. A determination 
must be made for each light ray regarding which lenses need 
to be included individually and which can be grouped together 
into pseudo-lenses. The lens equation then becomes: 



y = 



l-r 
o 



\ v (x-x,) v PL 

-, \x-cr c x- > mi -— > m, 

1+7 ) ^ |x-x,f j-j J 



(x-x^) 



(7) 
where Nl is the number of lenses to be included directly, and 
Npl is the number of pseudo-lenses with total mass m PL and 
centre of mass position x^ L . We note that the actual imple- 
mentation of the tree code uses higher order multipoles of the 
mass distribution in order to increase accuracy. This calculation 
scales on average as <9(logJV*). 

Additional techniques are used in the standard hierarchical 
tree code in order to improve the accuracy and speed of the cal- 
culation. For example, "test rays" are used to determine the 
cell structure, and then the same cells are used for a number of 
actual rays surrounding the test ray. Also, the deflection due 



to pseudo-lenses is only calculated directly for the test rays; 
for the actual rays, pseudo-lens deflections are interpolated be- 
tween those calcula ted for the test rays. For full details, see 
Wambsganssl(ll999l) . 



2.2. CPU-P 

A tree code requires the following data structures to be stored 
in memory: 2-D magnification map array; lens array (mass and 
2-D location for each lens); and the cell tree (stored as an ar- 
ray of indices), which provides a significant memory overhead 
as N, increases. In order to increase the scale of microlens- 
ing simulations beyond those that were achievable with CPU- 
T, where typical single-core memo r y perm its a maximum of 
20 million lenses. IGarsden & Lewisl (120101) implemented a dis- 
tributed, parallel version of the tree code (CPU-P). They take 
advantage of the inherent parallelism of the ray deflections by 
distributing rays between multiprocessors of a CPU-based clus- 
ter supercomputer, using a simple but effective approach to 
static load balancing between computing nodes. To handle the 
large memory requirements for billion-lens simulations (~ 200 
GB), they use a combination of local memory, distribution of 
data between nodes (thus sharing the memory requirements), 
and disk-based storage (although this limits processing speeds 
somewhat by the requirement of costly file accesss). 

The CPU-P code implements the same cell-tree and mi- 
crolensing algorithms as CPU-T, but surrounded by a quite dif- 
ferent environment within a running simulation. The generation 
of lens masses at the beginning of the run, and the ray-shooting 
itself, use as much parallelism as is available. The generation 
of the cell tree has not been parallelised, but is executed con- 
currently with another phase in the simulation. All these oper- 
ations are logically the same as in the CPU-T version but exe- 
cute without knowledge that they are running concurrently, or 
that their data may be contained within huge disk files. Disk 
files naturally degrade performance and parallel speed-up, but 
there are various caching schemes which can make up for that. 
CPU-P has been built to have the same accuracy as CPU-T, 
because the same numerical algorithms are used, however due 
to the reimplementation in a different programming language 
(FORTRAN for CPU-T, C++ for CPU-P), with different ran- 
dom number generators, and a different precision of integers 
(arbitrary precision has been implemented), there will naturally 
be minor differences. 

2.3. GPU-D 

The GPU parallel implementation of direct ray-shooting is 
achieved by splitting the total number of light rays into Nb 
batches, which are deflected in parallel. For convenience, the 
deflection calculation of equation <j4j is recast as: 



l-r 
o 



o 

l+y 



: + cr c x-y = .ZI]m,-— 



(x - x,) 



X,l' 



(8) 



where £ /=1 N, = N, for Nj parallel threads, and each process- 
ing step considers only a single light ray. The contributions 
from y and <x c are determined at the end of the deflection cal- 
culation. 



The GPU-D code, described in Thompson et al. (2010), was 
implemented in CUDA according to the following algorithm: 

1 . N, lens coordinates are obtained (see below) and stored in 
CPU memory. Coordinates are loaded into GPU device 
memory as required; 

2. A^b = 2 17 light ray coordinates are pseudo-randomly gen- 
erated on the CPU and loaded into GPU device memory; 

3. GPU computation is initialised; computation is split into 
groups of A^j = 128 threads; 

4. Each thread group loads Nj lenses and calculates deflec- 
tion on Nt rays; this is repeated until all lenses and rays 
are exhausted; 

5. Once computation on the GPU is complete, results are 
copied back from the GPU device memory to system 
memory; 

6. CPU maps the ray locations onto the source pixel grid in 
order to obtain the magnification map; 

7. Steps 2-6 repeated until an average of Af rays rays per source 
pixel is reached. 

The only change in approach from that of Thompson et al. 
(2010) is the way the N* lenses coordinates are generated: for 
this work, we first run a tree code and output lens positions to 
a data file. These coordinates are then read from the data file 
by the GPU-D code, and subsequent processing is carried out 
as described above. Without this modification, we would not 
be able to make pixel-by-pixel comparisons of the magnifica- 
tion maps, and would be left with just statistical comparisons 
based on, for example, magnification probability distributions. 
For full details on choice of optimisation of GP U parameters, 
especially Nb and A^t, see lThompson et al.1 (I2010I) . as these are 
somewhat hardware dependent. 



3. Comparing the methods 

In this section, we consider the relative accuracies of CPU-T, 
CPU-P and GPU-D over a much wider range of parameter space 
than has been considered previously, and we compare process- 
ing times between CPU-T and GPU-D, noting regions of pa- 
rameter space where each code might currently be preferred. 
The former aspect is important in assessing the validity of all 
microlensing results that have relied on the tree code to date, 
the latter in motivating a choice of code for future applications. 

The choice of model parameters for comparison is somewhat 
arbitrary. We attempt to cover the majority of cr and y param- 
eter space typically probed by realistic lens models, as well as 
typical map si zes and pixel scales used in those analyses (see 
for example: IWitt et al.lll995b lKochanek||2004t iMorgan et aT 
2006; Bat e et alJ l2008: MediavillaeTaD|2009t). 



3.1. Accuracy 

Since the tree code uses an approximation to the lens loca- 
tions, in some sense the direct approach produces a magnifi- 
cation map that is more accurate for a given configuration of 



lenses. An unanswered question is whether the hierarchical ap- 
proach maintains sufficient accuracy as the number of calcula- 
tions, <t>, is increased beyond approximately 2 x 10 14 . This re- 
striction is largely historical, due to the impracticality of using 
the direct ray-shooting approach on a single-core CPU beyond 
this limit. 

We conducted simulations for a range of parameters, sum- 
marised below: 

• N pix = 1024 2 , 2048 2 and 4096 2 ; 

• N mys = 100, 200, 500, 1000 and 2000; and 

• 0.1 < cr < 0.9 in increments of Act = 0.1, with y = 0.0 
and cr c = 0.0. This corresponds to N, = 65, 163, 313, 558, 
987, 1817, 3701, 9343 and 41257. 

• 0.0 < y < 1.0 in increments of Ay = 0.1, with cr - cr, - 
0.450 and cr c = 0.0. This corresponds to N t = 741, 814, 
1093, 1875, 4796, 41134, 41119, 4748, 1787, 956 and 
606. 



• 0.0 < s < 0.9 in increments of As 
where the smooth matter fraction s 
and y = 0.5. 



0.1 and s = 0.99, 
crjcr, cr = 0.450 



These parameter choices allow us to probe the range: 



6.8 xlO 10 <<t>< 1.4 xlO 16 , 



(9) 



where we assume that A^flop = 10. We choose all lenses to 
have the same lens mass as this does not influence the run-time. 
The CPU-T/CPU-P accuracy parameter was set to 0.60 for all 
of our simulations, as this was found to reliably produce accept- 
able magnification maps. CPU-P maps were generated using a 
minimum of 2 compute nodes, to ensure that the parallelisation 
was not producing inaccurate results. 

There is a subtle difference between the way Af rays is defined 
for CPU-T/CPU-P and GPU-D. In both codes, the same num- 
ber of rays are shot across the entire shooting grid. The CPU- 
T/CPU-P rays are shot on a regular grid, thus in the absence of 
lensing exactly Af lays are received at each CPU-T/CPU-P pixel. 
The GPU-D rays are shot from random positions in the shoot- 
ing grid, and so on average N mys are received at each GPU-D 
pixel. This will introduce Poisson noise into the GPU-D magni- 
fication maps, however the total number of rays shot is so large 
(10 8 to 4 x 10 10 in our simulations) that any error introduced by 
shooting rays randomly will be insignificant. 

We use two methods for assessing accuracy: the root mean 
square difference between corresponding magnification maps, 
and a Kolmogorov-Smirnov (K-S) test on probability distribu- 
tions for change in magnitude Am. The K-S test results in a 
measurement of the significance with which we can reject the 
null hypothesis, that the two samples were drawn from the same 
distribution. Change in magnitude is calculated from the (per 
pixel) map magnifications using the following relationship: 



Am, 



-2.5log l0 (fiij). 



(10) 



Throughout, we define "excellent agreement" as a root mean 
square difference between maps of less than 0.2 magnitudes (of- 
ten much less than 0.2 magnitudes), and a K-S test probability 
that the null hypothesis is not rejected of greater than 99.7%. 

3.1.1. No shear 

To begin with, we set both the external shear, y, and the 
smooth matter fraction, s - cr c /cr, to zero. This allows us to 
consider just the effect of compact objects, which are directly 
related to the number of lenses via equation ©. Later, we allow 
both y and s to vary. 

As indicative results, we present the calculated RMS val- 
ues and K-S probabilities, P(KS), for the N pix = 4096 2 case 
in Table Q] For cr > 0.3, we find excellent agreement be- 
tween the magnification maps from CPU-T and GPU-D when 
JV ra y S > 500. The <j = 0.1 and 0.2 cases differ slightly. Here, a 
low number of lenses {N* - 65 and 163 respectively) mean that 
there is very little variation in magnification across the maps. 
The K-S test probabilities are therefore strongly affected by 
small differences in magnification probability distributions. For 
JVpi x = 1024 2 and 2048 2 , N mys = 1000 are sufficient to obtain 
agreement between CPU-T and GPU-D. For A^ pix = 4096 2 even 
JV ra y S = 2000 are not enough. In other words, a larger number of 
rays per pixel are required for agreement between CPU-T and 
GPU-D when cr is low (< 0.2). 

We note also that for a given cr and N mys , the RMS error tends 
to increase with increasing N v \ x . For example, when cr - 0.4 
and N, 



rays 



500, we obtain the following RMS differences: 
0.039 mag (/V pix = 1024 2 ), 0.081 mag (N pix = 2048 2 ) and 0.168 
mag (Afpix = 4096 2 ). In other words, for a fixed accuracy we 
must increase N mys by a factor of ~ 2 if we increase N p \ x by a 
factor of 4. 

For the comparisons between CPU-P and GPU-D, we find 
the same results. In fact, the CPU-T and CPU-P maps are virtu- 
ally indistinguishable down to the accuracy of our comparison. 
They therefore behave identically when compared with GPU- 
D. 

3.1.2. Shear and convergence 

We then fixed the convergence at cr - 0.45, and examined 
the impact of varying shear over the range 0.0 < y < 1 .0 in 
increments of Ay = 0.1. The number of pixels was fixed at 
Afpj x = 2048 2 . Once again, the smooth matter fraction s was set 
to zero. We find very similar results to the no-shear case dis- 
cussed above: for Arrays > 500, the results of CPU-T and GPU-D 
are indistinguishable (RMS difference < 0.2 magnitudes and K- 
S probability > 99.7%), and similarly for comparisons between 
CPU-P and GPU-D. 

3.1.3. Smooth matter fraction 

Finally, we test the accuracy of CPU-T and CPU-P when the 
convergence is split into a continuously distributed component 
cr c and a compact stellar component <x„. We set y - 0.5 and 
cr — cr c + cr* - 0.45, and vary s in the range 0.0 < s < 0.9 
in increments of A s = 0.1. Since there is no microlensing if 
s - 1, we also simulated the s - 0.99 case. For convenience, 



Table 1: Comparison between the CPU-T, CPU-P and GPU-D 
codes for magnification maps with N plx = 4096 2 , y = 0.0 and 
s - 0.0 (corresponding to cr c = 0.0). RMS is the root mean 
square difference between magnification maps (compared pixel 
by pixel), and P(KS) is the K-S test probability. The CPU-T 
and CPU-P comparisons are virtually indistinguishable. 





CPU-T and GPU-D 


CPU-P and GPU-D 


Arrays CT 


RMS 


P(KS) 


RMS 


P(KS) 


100 0.1 


0.782 


0.247 


0.782 


0.247 


0.2 


0.757 


0.389 


0.757 


0.389 


0.3 


0.734 


0.000 


0.733 


0.000 


0.4 


0.712 


0.004 


0.712 


0.000 


0.5 


0.731 


0.802 


0.731 


0.802 


0.6 


0.730 


0.277 


0.730 


0.277 


0.7 


0.736 


1.000 


0.736 


1.000 


0.8 


0.799 


1.000 


0.799 


1.000 


0.9 


0.813 


1.000 


0.813 


1.000 


200 0.1 


0.390 


0.010 


0.390 


0.010 


0.2 


0.415 


0.413 


0.415 


0.413 


0.3 


0.442 


0.473 


0.442 


0.473 


0.4 


0.495 


0.835 


0.494 


0.835 


0.5 


0.490 


1.000 


0.490 


1.000 


0.6 


0.536 


0.856 


0.536 


0.856 


0.7 


0.572 


1.000 


0.572 


1.000 


0.8 


0.559 


1.000 


0.559 


1.000 


0.9 


0.601 


1.000 


0.601 


1.000 


500 0.1 


0.143 


0.015 


0.143 


0.015 


0.2 


0.150 


0.239 


0.150 


0.239 


0.3 


0.154 


0.997 


0.154 


0.997 


0.4 


0.168 


0.999 


0.168 


0.999 


0.5 


0.166 


1.000 


0.166 


1.000 


0.6 


0.185 


1.000 


0.185 


1.000 


0.7 


0.206 


1.000 


0.206 


1.000 


0.8 


0.191 


1.000 


0.191 


1.000 


0.9 


0.212 


1.000 


0.212 


1.000 


1000 0.1 


0.075 


0.039 


0.075 


0.039 


0.2 


0.071 


0.774 


0.071 


0.774 


0.3 


0.074 


1.000 


0.074 


1.000 


0.4 


0.081 


1.000 


0.081 


1.000 


0.5 


0.079 


1.000 


0.079 


1.000 


0.6 


0.087 


1.000 


0.087 


1.000 


0.7 


0.096 


1.000 


0.096 


1.000 


0.8 


0.090 


1.000 


0.089 


1.000 


0.9 


0.099 


1.000 


0.099 


1.000 


2000 0.1 


0.042 


0.822 


0.042 


1.000 


0.2 


0.035 


1.000 


0.035 


1.000 


0.3 


0.037 


1.000 


0.037 


1.000 


0.4 


0.040 


1.000 


0.039 


1.000 


0.5 


0.039 


1.000 


0.039 


1.000 


0.6 


0.043 


1.000 


0.043 


1.000 


0.7 


0.047 


1.000 


0.047 


1.000 


0.8 


0.044 


1.000 


0.044 


1.000 


0.9 


0.050 


1.000 


0.050 


1.000 



we choose N* 



pix 



2048 2 



For s < 0.8, we find the same results as the two cases dis- 
cussed above: for N mys > 500 the results of CPU-T and GPU-D 
are in excellent agreement. The s = 0.9 case has a slightly 
higher RMS difference at A^ rays = 500 of 0.4, although the K-S 
probability is still > 99.7%. For the s - 0.99 case, however, no 
agreement is reached between CPU-T and GPU-D. Visual in- 
spection of the maps shows that too few rays have been shot in 
the CPU-T case. This suggests that more rays should be shot in 
CPU-T for higher smooth matter fractions; A^ ra y S = 1000 should 
be sufficient for s = 0.90, however more than A^ays > 2000 is 
required in the s = 0.99 case. Once again, CPU-P performs 
identically to CPU-T. 

By comparing the accuracy of the two tree codes, which rely 
on a series of approximations to the lens distribution in order 
to speed up calculation time, with GPU-D, which conducts the 
entire brute-force calculation, we have verified the accuracy of 
the standard tool for high optical depth microlensing simula- 
tions. Indeed, we have considered a wider range of parame- 
ters, as represented through the quantity O, than previous work 
($>2x 10 14 ). While not unexpected, our results confirm that 
the tree code can indeed be used with confidence, provided a 
sufficiently large number of rays per pixel are shot (i.e. typi- 
cally Mays > 1000 for the parameter space explored here). 

3.2. Timing Tests 

Having established accuracy of the three codes under consid- 
eration, we now turn to timing tests. Specifically, we determine 
the time taken to generate magnification maps with a given set 
of model parameters, allowing us to determine the regions of 
(y, cr, ./Vpix) parameter space where one of the implementations 
may be favoured. 

We do not consider timing comparisons between CPU-T and 
CPU-P, as these are reported in Garsden & Lewis (2010). Since 
CPU-P run-times scale with the number of nodes available, it is 
difficult to establish which configuration provides the most ap- 
propriate comparison between CPU-P and GPU-D. Using the 
hardware cost as a factor in choosing between software alterna- 
tives, we suggest that the cost of purchasing a suitable high-end 
graphics card plus computer is much lower than the set-up costs 
for a modest CPU-based computing cluster, once networking 
overheads and cluster management are considered. We reflect 
further on the implications this has for the future of microlens- 
ing computations in Section [4] 

Care must be taken when quoting absolute speeds for dif- 
ferent hardware - we use our results as a guide to the relative 
processing speeds of CPU-T and GPU-D, but note that alter- 
native hardware can change timing by factors of a few. Our 
benchmarks were conducted on the following equipment: 

• CPU-T Intel Xeon Dual Core 3060 CPU (2.4 GHz) with 
1 GB RAM. 

• GPU-D NVIDIA S1070 Tesla unit connected via dual 
PCIe x8, or higher, buses to an Intel Q6600 Quad Core 
CPU (2.4 GHz) with GB RAM. The Tesla SI 070 com- 
prises four GPUs, each with 240 stream processors, run- 
ning at 1.296 GHz. Each GPU is able to access 4 GB of 



RAM. The quoted peak processing rate is 2.488 TFLOP/s, 
however, for GPU-D, a peak of 1.28 TFLOP/s was ob- 
tained using all four GPUs (Thompson et al. 2010). We 
note that the majority of our timing benchmarks in the 
present work were conducted using only two of the four 
GPUs available on the Tesla S1O7C0. 

Since we have determined that the magnification maps pro- 
duced by CPU-T are accurate for iV ra y S > 1000 across the ma- 
jority of parameter space, we conduct our timing tests by com- 
paring the total time taken by CPU-T and GPU-D to generate 
identical maps with Af ra y S = 1000, for each combination of pa- 
rameters. The results of these tests are discussed below. 

3.2.1. No shear 

Timing results for N pix = 1024 2 , 2048 2 and 4096 2 , with y = 
and s — 0, are plotted in Figure [2] Each panel shows both <x 
(along the top axis), and the logarithm of 2V» (along the bottom 
axis). Note that as s - for these simulations, cr = cr*. 

The performance of CPU-T is essentially constant as cr in- 
creases. The same is not true for GPU-D; it performs better 
for lower cr (lower N t ) values. This result is as expected; as 
the number of lenses is increased, the direct code is required 
to perform correspondingly more calculations, whereas the ap- 
proximations in the tree code (specifically the depth to which 
the cell tree is accessed) mean it performs roughly the same 
number of calculations at each step. 

The shape of the GP U-D timing curves is th e same as those 
presented in Figure 2 of iThompson et al.1 (120101) . For low num- 
bers of lenses (A 7 * < 10 4 ), overheads induced by the need to 
transfer data between GPU and CPU affect the performance of 
GPU-D. For N t > 10 4 , near the limit of our comparison, the 
performance of the code scales linearly with N,. While this 
linear growth suggests that there is a cross-over in run-time be- 
tween CPU-T and GPU-D at around N„ - 10 5 , it must be re- 
called that this is approaching the maximum lens limit of CPU- 
T, due to the memory requirements for storing the cell tree. 

We note that although the performance of GPU-D is decreas- 
ing as cr (or equivalently, N*) is increased, it is still faster than 
CPU-T for all of our y - 0.0 simulations. Using all four GPUs 
on the Tesla S1070 unit, instead of the two we use for this de- 
tailed timing comparison, results in factor of two speed-up. 

3.2.2. Shear and convergence 

Real lensing galaxies are unlikely to be perfectly spherically 
symmetric, and so realistic lens models do not contain zero ex- 
ternal shear. We therefore also conducted timing tests for vary- 
ing y. Equation|4] shows that the actual calculation of the effect 
of external shear should not significantly add to computation 
time - it simply adds one extra operation to the calculation of 
the deflection of each ray. However, the side lengths L ± of the 
rectangular shooting grid that is projected on to a square mag- 
nification map with side length L map depends on the shear in the 
following way: 



4 This was a logistical issue, as there were other users who required access 
to the Tesla unit while our timing tests were underway. 
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Figure 2: Timing results for GPU-D (solid line) and CPU-T (dashed line) as a function of convergence, cr, or equivalently, number 
of lenses, N*. In all cases, shear y - 0.0, Mays = 1000 and s — (i.e. all of the convergence is in a compact stellar component, 
cr = cr,), Panels are A pix = 1024 2 (left); N pix = 2048 2 (middle); A pix = 4096 2 (right). 
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When l-cr-yw0orl-cr + ya;0, one side of the shooting 
grid can be very large. This leads to a large area for the lens 
plane, and hence a correspondingly large number of lenses (see 
Equation[5]and the discussion following it). 



We conducted timing tests for the case where A ra 



rays 



N, 



pix 



1000, 
2048 2 , s = and cr = cr, = 0.45, with < y < 1.0 
and Ay = 0.1. The results of these tests are presented in Figure 
[3] The figure shows y versus the logarithm of the time taken 
to generate the map. We also plot time against the theoretically 
predicted average magnification // av in that figure, where /z av is 
given by: 

1 



Mav = 



(12) 



(l-o-) 2 -y 2 

A negative magnification is interpreted as a parity flip in the 
image. 

For low magnifications (when 1 — cr — y and 1 - cr + y are 
very different from zero), GPU-D is faster than CPU-T. There 
is a crossover point at a magnification of ~ 5 where CPU-T 
begins to out-perform GPU-D. For high magnifications, CPU- 
T is significantly faster than GPU-D. We emphasise again that 
these timing tests were conducted using only two of the avail- 
able four GPUs. Running on all four GPUs should improve the 
performance of GPU-D by approximately a factor of two. 

The timing behaviour of GPU-D with the addition of a non- 
zero shear is exactly the same as the zero shear case. Initially, 
overheads dominate the code's performance, but as the number 
of lenses is increased to ~ 10 4 , the code begins to scale linearly 
with N*. 

The same is not true for CPU-T - the timing behaviour in the 
zero shear and varying shear cases seems to be quite different. 
This is due to the way in which the shooting grid is stretched 
by the addition of a non-zero shear. Equation QT| shows that 



theoretical magnification |/z av | 
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Figure 3: Timing results for GPU-D (solid line) and CPU-T 
(dashed line) as a function of shear y. The number of pixels 
Afpi x = 2048 2 , smooth matter fraction s - 0, and the conver- 
gence is cr - cr t — 0.45. An average of Af, 



rays 



1000 was 

shot per pixel. The top axis shows the theoretically predicated 
average magnification in the absence of microlensing. 



for certain combinations of cr and y the shooting grid will be 
significantly stretched in one direction. A ray shot from one 
corner of this stretched grid is close to fewer lenses than a ray 
shot from a corner of a square with the same area and thus a 
greater number of lenses can be grouped together into pseudo- 
lenses. Although N* goes up for these cases, the number of 
lenses that need to be included directly in the calculation goes 
down for the tree code. 

We expect this behaviour to be generic as cr is varied - 
CPU-T will be faster than GPU-D in the high magnification 
regime on current-generation hardware, whereas GPU-D will 
out-perform CPU-T in the low magnification regime. However, 
as we discuss in Section [4] the former situation may only be a 
limit in the short-term. 



log(N.) 



To p rovide a physical context for these results, the lKoch anek 
(2004) model for image A in the lensed quasar Q2237+0305 
has cr - 0.394 and y = 0.395, and a corresponding theoretical 
magnification ju av = 4.735. This is in line with most models for 
Q2237+0305 image A, which typically have cr ~ 0.4, y ~ 0.4 
and yu av ~ 5 (e.g. Wyithe, A gol & Fluke 2002). In compari- 
son, the lKeetonet a l. (2006) model for the high magnification 
image D in SDSS J0924+0219 has cr = 0.476, y = 0.565 and a 
magnification of yU av = -22.4. 

3.2.3. Smooth matter fraction 

Finally, we present the results of the timing tests for varying 
smooth matter fraction in Figure [4] Tests were conducted for 
JVpi x = 2048 2 , cr = 0.45, y = 0.5, and JV rays = 1000. This 
combination of cr and y corresponds to the "worst case" mag- 
nification for GPU-D timing, and the "best case" magnification 
for CPU-T timing (see the previous section). 

Increasing the smooth matter fraction for fixed cr corresponds 
to decreasing N„ as mass is transferred from the compact stel- 
lar component to the continuously distributed component. We 
therefore see exactly the timing behaviour we expect: CPU- 
T takes roughly the same amount of time for each s, whereas 
GPU-D is faster for fewer lenses. 

At the distances from the centre of lensing galaxies where 
lensed images are typically observed (~ 2 — 10 kpc), we ex- 
pect the smooth matter fraction to be quite high. Indeed, a few 
rough microlensing measurements of smooth matter percentage 
at lensed image positions exist: 80% to 85 % at the location o f 
the A and D images in SP SS J0924+0219 JKeeton et all[2006b : 
~ 80% in HE 1 104+080 JChartas et all l2009t) : ~ 90% in PG 
1 1 15+080 (IPoolev et alll2009l) : and ~ 70% in RXJ 1131-1231 
(IDai et all 120101) . For smooth matter fractions in this range, 
we find that GPU-D already performs comparably to CPU-T in 
high magnification systems, and out-performs it in low magni- 
fication systems. 

We note that our accuracy tests in Section |3 . 1 . 31 showed that 
even JV lays = 2000 were not sufficient for CPU-T to obtain re- 
sults indistinguishable from GPU-D at s - 0.99, at least with 
the CPU-T accuracy parameter used throughout this paper. The 
s - 0.99 point in the Figure |4]CPU-T results is therefore likely 
to be lower than the actual run-time required to produce a use- 
able map. 
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Figure 4: Timing results for GPU-D (solid line) and CPU-T 
(dashed line) as a function of smooth matter fraction s. The 
total convergence cr = 0.40, shear y = 0.5, the number of pix- 
els A^pi x = 2048 2 , and the average number of rays per pixel is 
N mys = 1000. 



4. Discussion 

Our intention in this work is not to imply that the tree code 
is excessively inaccurate. However, its reliance on an approx- 
imation to the lens distribution to reduce the number of cal- 
culations per magnification map warrants further investigation 
that has not been feasible until now. The standard tree code 
(Wambsganss 1990; 1999) has been in widespread use through- 
out the lensing community since the early 1990s. Though it is 
unsurprising, it is nevertheless reassuring to know that CPU- 
T remains accurate over an extended range of O-space. Our 
comparison here demonstrates that researchers can, and should, 
continue to use the tree code for microlensing simulations with 
confidence. Additionally, we have demonstrated that this ac- 
curacy is also maintained by the CPU-P code, which enables 
exploration of problems orders of magnitude beyond those that 
can be achieved on a single-core CPU. 

However, we stress that the direct inverse ray-shooting code 
is more accurate than its tree code counterparts by simple virtue 
that it includes contributions from all lenses without approxi- 
mations - the entire calculation is done by brute force, taking 
advantage of the massively parallel GPU architecture to com- 
plete the computation in a reasonable timeframe. While CPU-T 
can indeed be used for a wide range of astrophysically relevant 
scenarios, GPU-D is now a highly competitive alternative. 

Given that we have verified the accuracy of the tree codes, 
though, it is reasonable to ask in which areas of parameter space 
each code should be used. The larger the number of lenses, the 
better the tree codes perform relative to GPU-D. For the val- 
ues of JVpi x and N rdys we examined in our benchmarking, the 
crossover point where CPU-T begins to out-perform its GPU- 
based counterpart is N* ~ 50, 000 lenses. This can occur either 



when (cr - y) or (cr + y) are close to 1. For scenarios with 
5 there are advantages in moving to 



JV pix > 4096 2 and N* > 10 



a distributed CPU-based computing cluster, although this can 
be a financially costly approach. One of the perceived bene- 
fits of GPU computing is the heavily reduced dollar per Tflop/s 
they offer, which makes th em an attractive alterna tive as "desk- 
top" supercomputers (see iThompson et al.ll2010h . The down- 
side of GPUs, at this point in time, is the paucity of astronomy 
codes that have been adapted to run on them, whereas CPU- 
based computing clusters can support a much wider range of 
existing codes. 

Since about 2005, single-core CPU processing speeds have 
plateaued. Additional computing performance has been 
achieved, and indeed Moore's Law growth in processing power 
has been maintained, by moving to multi-core CPUs. A grow- 
ing range of applications across a variety of scientific disci- 
plines are now taking advantage of the "many-core" processing 
architecture of GPUs, which are providing a preview of the fu- 
ture of CPU architectures. Yet if GPUs do indeed point towards 
a future where many-core processing is mainstream, the need 
to convert existing trusted, tested, accepted codes will become 
more pressing. A question that many researchers will have to 
answer is which code should be adapted? 

We suggest that decisions to move other standard astronomy 
computations to GPU may benefit from a similar approach to 
Thompson et al. (2010). Rather than attempting to directly 
port an existing advanced code (i.e. CPU-T) to GPU, they con- 
sidered a brute force approach that was not feasible for single- 
core CPU, but which exhibited the type of fine-grained, mas- 
sive parallelism that is ideal for GPGPU. This resulted in much 
faster code development time, despite the learning curve as- 
sociated with adopting CUDA and the specifics of GPU par- 
allel programming. Moreover, we have shown that in most 
astrophysically-motivated cases, we are already no worse off 
using GPU-D compared with CPU-T, with the added benefit of 
greater intrinsic accuracy. 

The ^.-crossover in timing tests that we have identified does 
depend on the hardwared used. For example, our benchmark- 
ing was performed using two of the four GPUs available on the 
Tesla S1070, while the CPUs used for CPU-T were a few years 
old. The key point, however, is that the GPU-D is already par- 
allel - GPU performance will increase, and more GPU cores 
will be placed in individual machines, yet no further code de- 
velopment is required. The single-core tree code, however, is 
more heavily restricted - CPU-T can no longer make signif- 
icant processing gains by simply relying on Moore's Law to 
make single-core CPU hardware faster. 

5. Conclusion 

In this work, we have examined the accuracy of three highly 
competive alternatives for generating microlensing magnifica- 
tion maps: a widely-used hierachical tree code for single-core 
CPUs; a parallel, large data implementation for a distributed 
CPU-based computing cluster; and a brute force solution that 
utilises the massive parallelism of modern graphics processing 
units. We have demonstrated that these three codes do indeed 



present comparable accuracy, and choice of approach will de- 
pend on considerations relating to the scale and nature of the 
problem to be investigated. 

While considered impractical as a calculating approach even 
a few years ago, our timing tests show that direct, inverse ray- 
shooting is now a viable, and indeed more accurate, option. 
Moving a simple algorithm to GPU, rather than attempting to 
directly implement a more advanced solution - originally devel- 
oped to overcome a hardware restriction of single-core CPUs - 
to achieve run-times that are already similar, suggests an ap- 
proach that other early adopters of GPUs for astronomy com- 
putation may wish to consider. 
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